Table~\ref{tab:design-space} describes the various dimensions we evaluate in the 3D-device-heterogeneous architecture space.

\begin{table}[ht!]\small
\centering
\begin{center}
\begin{tabular}{|c|c|} \hline
%Workload Type & \\
CMOS Technology & 22nm Si FinFET \\	\hline
CMOS Frequency Range & 0.5~GHz - 3~GHz  \\	\hline
TFET Frequency Range & 0.5~GHz - 1.75~GHz  \\	\hline
TFET Technology & 22nm HTFET  \\	\hline
Number of layers & 1 - 8	\\\hline
Total Number of Cores & 1 - 128	\\ \hline
Number of utilizable cores & 1-64 \\ \hline
Thermal limit	& 360K (air cooled)	\\ \hline
Ambient temperature 	& 300K	\\ \hline
% & per core; 4MB shared LLC 
% & L1 hit latency: 1 cycles; L2 hit latency: 8 cycles\\ \hline
%Memory & 4GB; DDR2-1600; 1 memory channel; \hline
\end{tabular}
\caption {Configuration of the evaluation platform.}
\label{tab:design-space}
\end{center}
\vspace{-0.3in}
\end{table}


In this section, we analyze the effect of each of these dimensions on thermally constrained performance.

\subsection{Technology variation}
Our base device models are based on 20~nm technology node simulations
calibrated with fabricated devices~\cite{dac11}. In addition, we also use
TFET device scaling~\cite{Lu-tfet-scaling} to model technologies at
the 10~nm node (where we expect TFETs to become commercially
available).

% Hence our device models were calibrated to correspond to this model. 
%However Liu \emph{et. al}~\cite{lu-tfet-scaling} have projected TFET device scal%ing characteristics down to the 10~nm node.
%Figure (technology scaling-Huichu) shows the CMOS-TFET tradeoffs, when we compare Si FinFET and TFET device delays at 22~nm, 14~nm, 10~nm technology nodes. Using the processor-level abstraction models discussed in Section~\ref{sec:background}, we arrive at an estimate of the performance and power of these processors.
%Currently, at the 22nm node, we propose using heterogeneous multicores comprising of both CMOS and TFET processors, in order to cater to all classes of applications.
%However, by the 10~nm node, we estimate that TFET-base processors can prove to be a viable replacement for CMOS.


Figure~\ref{fig:technology-scaling3a} shows the scaling of the critical path delay when extrapolated to future technology nodes. 
Comparisons are made between the ITRS 2012~\cite{itrs2011} roadmap projections for Si FinFET and simulation results for HTFET.
%As part of the simulation, Si FinFET and HTFET models for 22nm technology node have been calibrated with experimental data. % respectively [reference of model calibration in previous DATE paper]. 
The HTFET device models for 14~nm and 10~nm technology nodes are generated from simulations using the TCAD Sentaurus device modeling tool~\cite{tcad-sentaurus}. 
The supply voltage $V_{cc}$ corresponding to each technology node is $V_{cc}=0.72V$ ($22$~nm node),  $0.67$~V($14$~nm node), $0.55$~V ($10$~nm node) for Si FinFET technology; and $V_{cc}=0.4V$ ($22$~nm node), $0.35$~V ($14$~nm node) and $0.3$~V ($10$~nm node) for HTFET technology.
%Since there is no ITRS roadmap for tunneling devices, we assume the same relative scaling ratio observed in our device simulations between CMOS and TFET in order to obtain a projection for TFET consistent with the ITRS roadmap. %fixme
%These projections are optimistic in comparison with the TCAD simulation models, since the drive current assumed by the ITRS projections is much higher than the simulated value, due to changes in the charge transport mechanisms at shortened gate lengths. %check with Huichu
%following ITRS low operating power (LOP) target with fixed leakage of 5nA/um. 
In a similar manner, Figure~\ref{fig:technology-scaling3b} shows the scaling of the total core power for each of the above technology nodes. %In this case, the simulation results closely match the ITRS projections.

% The supply voltage of TFET cores scales from 0.4V at 22~nm to 0.3V at 10~nm, while the FinFETs are operated at supply $V_{dd}$ values corresponding to the ITRS 2012 roadmap projections for these technologies. (cite ITRS).
From our models, we observe that one of the major limitations with TFET processors at the 22~nm node is their relatively low peak performance.
The saturating nature of TFET tunneling current forces the peak frequency to be restricted to around 1.5-1.6~GHz, which is far below what CMOS processors are capable of attaining.
However the minimum switching delay of the device reduces with subsequent generations, enabling TFET processors to operate at much higher frequency. 
Although there is a proportional decrease in FinFET switching as well, the non-scaling of wire-delays causes the frequency gap between CMOS and TFET processors to shrink with every generation. 
This is because the gap in the critical path delay between CMOS and TFET goes on decreasing with technology and by the 10nm node, TFET cores can attain 95\% of the peak performance of CMOS, as compared to 60\% for the current (~22nm) technology node. Further, TFETs become more and more power efficient w.r.t CMOS with each subsequent generation.
Thus the range of applications where TFETs can act as a viable replacement goes on expanding as transistors continue to scale.

%\begin{figure}[ht!]
%\begin{minipage}[b]{1\linewidth}
%  \centering
%    \epsfig{file=figs/technology_scaling3a.eps, angle=0, width=1\linewidth, clip=}
%    \caption{\footnotesize\label{fig:technology-scaling3a} a) Variation of total critical path delay (including logic and wire delay) in a CMOS and TFET processor at 22nm, 14nm and 10nm technology nodes. }
%\end{minipage}
%\vspace{0.1in}
%\begin{minipage}[b]{1\linewidth}
%\begin{figure}[ht!]
%  \centering
%    \epsfig{file=figs/technology_scaling3.eps, angle=0, width=1\linewidth, clip=}
%    \caption{\footnotesize\label{fig:technology-scaling3b} b) Variation of total core power (including logic and wire power) in a CMOS and TFET processor at 22nm, 14nm and 10nm technology nodes. }
%\end{minipage}
%\end{figure}

\begin{figure}[ht]
\begin{minipage}[b]{1\linewidth}
\centering
    \epsfig{file=figs/technology_scaling3a.eps, angle=0, width=0.9\linewidth, clip=}
    \caption{\label{fig:technology-scaling3a} Variation of total critical path delay (logic + wire delay) in FinFET and TFET  processors at 22, 14 and 10~nm technology nodes. Scaling is demonstrated both for the ITRS roadmap projections and TCAD simulations for FinFET and TFET.}
\end{minipage}
%\vspace{0.2cm}
\begin{minipage}[b]{1\linewidth}
\centering
    \epsfig{file=figs/technology_scaling3b.eps, angle=0, width=0.9\linewidth, clip=}
    \caption{\label{fig:technology-scaling3b} Variation of total core power (including logic and wire power) in a CMOS and TFET processor at 22~nm, 14~nm and 10~nm technology nodes. }
\label{fig:figure2}
\end{minipage}
\end{figure}

\subsection{Yield aware stacking of processors}

As explained in Section~\ref{sec:motivation}, processor yield reduces super-linearly with increase in die area.
As shown in equation $21$ in~\cite{yibo-yield-iccad}, the yield varies with area as a Gamma function.
While reducing the area footprint by stacking cores can improve the die yield, there are losses on account of joining 2 layers together, quantified by the \emph{bonding yield}, as shown in equation $24$ in ~\cite{yibo-yield-iccad}.
As a result there is a tradeoff between increasing the die size and increasing the number of layers. 

%Phil Emma's stuff
%This tradeoff is illustrated in Figure~\ref{fig:yield-distribution}. 
%In this figure, we observe that the yield reduces drastically beyond 8 cores and 4 layers. 
In order to counteract this yield variation for a multilayered processor system, we consider the use of redundant or spare cores.
For a given multicore system, we consider only a subset of the total cores on chip to be operational. 
The remaining cores are used to ensure that a minimum yield requirement of 50\% is met.
This is known as \emph{core sparing} and is commonly used as a technique to improve the overall yield of several processors in industry~\cite{emma-3d}.
Although additional hardware resources are spent on these spare cores, the improvement in yield significantly shortens the time to market for these processors. This is a far more viable alternative than aiming to improve the fabrication process both from a time and cost perspective.
Adding a single spare core to an 8 core system can reduce the time to market by nearly 50\%. However the number of spare cores needed to meet the yield criteria increases with the total number of cores.
For larger number of cores and layers, the yield drops drastically, resulting in more than 50\% of the cores being used for redundancy.

Figure~\ref{fig:useful-cores-sampled} shows the number of stacked layers required to obtain a particular number of cores for different area footprints. 
The redundancy ratio is defined as the fraction of excess cores required to meet the yield threshold.
It can be observed that both area footprint and number of layers cause this redundancy ratio to increase. 
For smaller areas the reduction in yield due to bonding is a more dominant characteristic, as indicated by the increase in redundancy ratio for more stacked layers. 
However, as the area per layer increases, the yield decreases at a faster rate and folding the cores to stack them in multiple layers can arrest this decline.
The maximum area footprint considered is 400~$mm^{2}$ per layer.
In order to meet the yield constraint, it is essential to increase the number of layers to accommodate the redundant cores. 
This adversely affects the thermal behavior of the processor, further constraining the design space.

An important advantage that TFET technology has over other emerging devices is that it is compatible with the CMOS fabrication process~\cite{tfet-intel}.
Further, the process steps involved in the manufacture of TFET processors is similar to that of CMOS. As a result, we assume that this technology is similarly affected by process variation and displays similar yield as CMOS~\cite{tfet-sram}. 
In order to account for uncertainties due to the new technology, we ran yield experiments for TFET processors by reducing the baseline yield by 5\% and 10\%. The reduction in overall yield could be compensated by adding an additional redundancy of 8\% and 17\% respectively, without compromising the feasible design space for TFET processors.

\begin{figure}[ht!]
  \centering
    \epsfig{file=figs/useful_cores_sampled.eps, angle=0, width=1\linewidth, clip=}
    \caption{\label{fig:useful-cores-sampled} Number of core layers required to realize a range of functioning cores for different area footprints. The fraction of redundant cores can be seen to increase both with area and with number of layers}
\end{figure}


%Figure~\ref{redundant cores} shows the number of redundant cores for different number of layers and cores per layer.


\subsection{Modeling thermal distribution across multicores}

In addition to affecting processor reliability and lifetime, the cost of cooling and packaging is determined by the the thermal profile of the processor.
Different cooling technologies such as microfluidic cooling can push this thermal limit up. 
For instance microfluidic cooling techniques for 3D processor-on-processor stacking can reduce core temperature by as much as 15$^{\circ}$C ~\cite{microfluidic-cooling}.
For the purpose of this study we consider a thermal limit of around 85-90$^{\circ}$C (358-363K), assuming an air cooled machine. Microfluidic cooling could enable the temperature bound to be raised to around 100-105$^{\circ}$C.
% or 373-378K.

%There are several parameters that can be tuned in the technology and microarchitecture space, under the thermal and yield constraints described.


\subsection{Variation in microarchitecture}
As part of our studies we evaluated the effect of microarchitecture
changes under the thermal constraint. A simpler and narrower-issue
processor will consume less power than a wider, more complex core. As
a result, there is more thermal slack available, which could enable a
higher frequency operation than the complex processor.  Thus it could
be possible to match the performance of the complex core using a
simpler core configuration, under similar thermal constraints.
However, preliminary experiments (not shown) indicated that the
performance loss from moving from out-of-order to in-order execution
outweighs power and thermal advantages for the space we
consider. Thus, we employ a low-issue out-of-order configuration as
our simpler core.  Our experiments were carried out on a 2-issue
Atom-like core configuration and a 4-issue Ivybridge-like
microarchitecture.

The ability to exploit the greater microarchitecture complexity or
increased number of cores depend on the application characteristics,
in particular the ILP and TLP of the application.  As discussed in
Section~\ref{sec:motivation}, there is a region in the parallelism v/s
frequency plot that is not attainable because of yield and thermal
constraints.  By using a combination of power-efficient TFET
technology with high performance CMOS can expand the design
space. Using TFET cores in conjunction with 3D technology can reduce
the power consumed, while maintaining processor yield, thus mitigating
the thermal constraint.  However, whether this extra design space
manifests itself as a performance improvement depends entirely on the
application scaling behavior.

Figure~\ref{fig:building-collapse-result}a) and b) show the 2 applications, \emph{barnes} and \emph{ocean.cont}, respectively, that represent the extreme edges of application scaling with cores.
The additional TFET cores operating at low frequency, would prove beneficial for highly parallel applications like \emph{barnes} which is then able to improve its peak performance, as seen in Figure~\ref{fig:building-collapse-result}a) .
On the other hand, an application like \emph{ocean.ncont}, shown in Figure~\ref{fig:building-collapse-result}b), which has limited TLP, prefers operating on fewer cores at higher frequency. 
%This mode of operation precludes a 3D stacking configuration as the thermal cost would be prohibitive at such high frequencies.
It is evident that \emph{barnes} benefits greatly from the extra number of cores, whereas the effect is not very significant in \emph{ocean.ncont}.


\begin{figure}[ht!]
  \centering
    \epsfig{file=figs/building_collapse_result.eps, angle=0, width=1\linewidth, clip=}
    \caption{\label{fig:building-collapse-result}  a) and b) Delineation of design space attainable by CMOS(red), TFET(blue), both (green) and neither (black) cores to obtain peak performance, for a scalable (\emph{barnes}) and non-scalable (\emph{ocean.ncont}) application respectively. The best performance is seen in the TFET configuration in \emph{barnes} and in the CMOS configuration in \emph{ocean.ncont}.}
\end{figure}

